Search | WHO COVID-19 Research Database

Using Haplotype-Based Artificial Intelligence to Evaluate SARS-CoV-2 Novel Variants and Mutations.

Zhao, Lue Ping; Cohen, Seth; Zhao, Michael; Madeleine, Margaret; Payne, Thomas H; Lybrand, Terry P; Geraghty, Daniel E; Jerome, Keith R; Corey, Lawrence.

JAMA Netw Open ; 6(2): e230191, 2023 02 01.

Article in English | MEDLINE | ID: covidwho-2288771

ABSTRACT

Importance: Earlier detection of emerging novel SARS-COV-2 variants is important for public health surveillance of potential viral threats and for earlier prevention research. Artificial intelligence may facilitate early detection of SARS-CoV2 emerging novel variants based on variant-specific mutation haplotypes and, in turn, be associated with enhanced implementation of risk-stratified public health prevention strategies. Objective: To develop a haplotype-based artificial intelligence (HAI) model for identifying novel variants, including mixture variants (MVs) of known variants and new variants with novel mutations. Design, Setting, and Participants: This cross-sectional study used serially observed viral genomic sequences globally (prior to March 14, 2022) to train and validate the HAI model and used it to identify variants arising from a prospective set of viruses from March 15 to May 18, 2022. Main Outcomes and Measures: Viral sequences, collection dates, and locations were subjected to statistical learning analysis to estimate variant-specific core mutations and haplotype frequencies, which were then used to construct an HAI model to identify novel variants. Results: Through training on more than 5 million viral sequences, an HAI model was built, and its identification performance was validated on an independent validation set of more than 5 million viruses. Its identification performance was assessed on a prospective set of 344â¯901 viruses. In addition to achieving an accuracy of 92.8% (95% CI within 0.1%), the HAI model identified 4 Omicron MVs (Omicron-Alpha, Omicron-Delta, Omicron-Epsilon, and Omicron-Zeta), 2 Delta MVs (Delta-Kappa and Delta-Zeta), and 1 Alpha-Epsilon MV, among which Omicron-Epsilon MVs were most frequent (609/657 MVs [92.7%]). Furthermore, the HAI model found that 1699 Omicron viruses had unidentifiable variants given that these variants acquired novel mutations. Lastly, 524 variant-unassigned and variant-unidentifiable viruses carried 16 novel mutations, 8 of which were increasing in prevalence percentages as of May 2022. Conclusions and Relevance: In this cross-sectional study, an HAI model found SARS-COV-2 viruses with MV or novel mutations in the global population, which may require closer examination and monitoring. These results suggest that HAI may complement phylogenic variant assignment, providing additional insights into emerging novel variants in the population.

Subject(s)

Artificial Intelligence , COVID-19 , Humans , Cross-Sectional Studies , Haplotypes , Prospective Studies , RNA, Viral , SARS-CoV-2 , Mutation

Rapidly identifying new coronavirus mutations of potential concern in the Omicron variant using an unsupervised learning strategy.

Zhao, Lue Ping; Lybrand, Terry P; Gilbert, Peter B; Payne, Thomas H; Pyo, Chul-Woo; Geraghty, Daniel E; Jerome, Keith R.

Sci Rep ; 12(1): 19089, 2022 Nov 09.

Article in English | MEDLINE | ID: covidwho-2106470

ABSTRACT

Extensive mutations in the Omicron spike protein appear to accelerate the transmission of SARS-CoV-2, and rapid infections increase the odds that additional mutants will emerge. To build an investigative framework, we have applied an unsupervised machine learning approach to 4296 Omicron viral genomes collected and deposited to GISAID as of December 14, 2021, and have identified a core haplotype of 28 polymutants (A67V, T95I, G339D, R346K, S371L, S373P, S375F, K417N, N440K, G446S, S477N, T478K, E484A, Q493R, G496S, Q498R, N501Y, Y505H, T547K, D614G, H655Y, N679K, P681H, N764K, K796Y, N856K, Q954H, N69K, L981F) in the spike protein and a separate core haplotype of 17 polymutants in non-spike genes: (K38, A1892) in nsp3, T492 in nsp4, (P132, V247, T280, S284) in 3C-like proteinase, I189 in nsp6, P323 in RNA-dependent RNA polymerase, I42 in Exonuclease, T9 in envelope protein, (D3, Q19, A63) in membrane glycoprotein, and (P13, R203, G204) in nucleocapsid phosphoprotein. Using these core haplotypes as reference, we have identified four newly emerging polymutants (R346, A701, I1081, N1192) in the spike protein (p value = 9.37*10-4, 1.0*10-15, 4.76*10-7 and 1.56*10-4, respectively), and five additional polymutants in non-spike genes (D343G in nucleocapsid phosphoprotein, V1069I in nsp3, V94A in nsp4, F694Y in the RNA-dependent RNA polymerase and L106L/F of ORF3a) that exhibit significant increasing trajectories (all p values < 1.0*10-15). In the absence of relevant clinical data for these newly emerging mutations, it is important to monitor them closely. Two emerging mutations may be of particular concern: the N1192S mutation in spike protein locates in an extremely highly conserved region of all human coronaviruses that is integral to the viral fusion process, and the F694Y mutation in the RNA polymerase may induce conformational changes that could impact remdesivir binding.

Subject(s)

COVID-19 , Spike Glycoprotein, Coronavirus , Humans , Spike Glycoprotein, Coronavirus/genetics , Unsupervised Machine Learning , SARS-CoV-2/genetics , COVID-19/epidemiology , COVID-19/genetics , RNA-Dependent RNA Polymerase , Mutation , Phosphoproteins/genetics

Application of Statistical Learning to Identify Omicron Mutations in SARS-CoV-2 Viral Genome Sequence Data From Populations in Africa and the United States.

Zhao, Lue Ping; Lybrand, Terry P; Gilbert, Peter; Madeleine, Margaret; Payne, Thomas H; Cohen, Seth; Geraghty, Daniel E; Jerome, Keith R; Corey, Lawrence.

JAMA Netw Open ; 5(9): e2230293, 2022 09 01.

Article in English | MEDLINE | ID: covidwho-2013243

ABSTRACT

Importance: With timely collection of SARS-CoV-2 viral genome sequences, it is important to apply efficient data analytics to detect emerging variants at the earliest time. Objective: To evaluate the application of a statistical learning strategy (SLS) to improve early detection of novel SARS-CoV-2 variants using viral sequence data from global surveillance. Design, Setting, and Participants: This case series applied an SLS to viral genomic sequence data collected from 63â¯686 individuals in Africa and 531â¯827 individuals in the United States with SARS-CoV-2. Data were collected from January 1, 2020, to December 28, 2021. Main Outcomes and Measures: The outcome was an indicator of Omicron variant derived from viral sequences. Centering on a temporally collected outcome, the SLS used the generalized additive model to estimate locally averaged Omicron caseload percentages (OCPs) over time to characterize Omicron expansion and to estimate when OCP exceeded 10%, 25%, 50%, and 75% of the caseload. Additionally, an unsupervised learning technique was applied to visualize Omicron expansions, and temporal and spatial distributions of Omicron cases were investigated. Results: In total, there were 2698 cases of Omicron in Africa and 12â¯141 in the United States. The SLS found that Omicron was detectable in South Africa as early as December 31, 2020. With 10% OCP as a threshold, it may have been possible to declare Omicron a variant of concern as early as November 4, 2021, in South Africa. In the United States, the application of SLS suggested that the first case was detectable on November 21, 2021. Conclusions and Relevance: The application of SLS demonstrates how the Omicron variant may have emerged and expanded in Africa and the United States. Earlier detection could help the global effort in disease prevention and control. To optimize early detection, efficient data analytics, such as SLS, could assist in the rapid identification of new variants as soon as they emerge, with or without lineages designated, using viral sequence data from global surveillance.

Subject(s)

COVID-19 , SARS-CoV-2 , COVID-19/epidemiology , Genome, Viral/genetics , Humans , Mutation , SARS-CoV-2/genetics , South Africa , United States/epidemiology

Mutations in viral nucleocapsid protein and endoRNase are discovered to associate with COVID19 hospitalization risk

Zhao, Lue Ping, Roychoudhury, Pavitra, Gilbert, Peter, Schiffer, Joshua, Lybrand, Terry P.; Payne, Thomas H.; Randhawa, April, Thiebaud, Sara, Mills, Margaret, Greninger, Alex, Pyo, Chul-Woo, Wang, Ruihan, Li, Renyu, Thomas, Alexander, Norris, Brandon, Nelson, Wyatt C.; Jerome, Keith R.; Geraghty, Daniel E..

Scientific reports ; 12(1), 2022.

Article in English | EuropePMC | ID: covidwho-1652406

ABSTRACT

SARS-CoV-2 is spreading worldwide with continuously evolving variants, some of which occur in the Spike protein and appear to increase viral transmissibility. However, variants that cause severe COVID-19 or lead to other breakthroughs have not been well characterized. To discover such viral variants, we assembled a cohort of 683 COVID-19 patients;388 inpatients (“cases”) and 295 outpatients (“controls”) from April to August 2020 using electronically captured COVID test request forms and sequenced their viral genomes. To improve the analytical power, we accessed 7137 viral sequences in Washington State to filter out viral single nucleotide variants (SNVs) that did not have significant expansions over the collection period. Applying this filter led to the identification of 53 SNVs that were statistically significant, of which 13 SNVs each had 3 or more variant copies in the discovery cohort. Correlating these selected SNVs with case/control status, eight SNVs were found to significantly associate with inpatient status (q-values < 0.01). Using temporal synchrony, we identified a four SNV-haplotype (t19839-g28881-g28882-g28883) that was significantly associated with case/control status (Fisher’s exact p = 2.84 × 10–11). This haplotype appeared in April 2020, peaked in June, and persisted into January 2021. The association was replicated (OR = 5.46, p-value = 4.71 × 10−12) in an independent cohort of 964 COVID-19 patients (June 1, 2020 to March 31, 2021). The haplotype included a synonymous change N73N in endoRNase, and three non-synonymous changes coding residues R203K, R203S and G204R in the nucleocapsid protein. This discovery points to the potential functional role of the nucleocapsid protein in triggering “cytokine storms” and severe COVID-19 that led to hospitalization. The study further emphasizes a need for tracking and analyzing viral sequences in correlations with clinical status.

Tracking SARS-CoV-2 Spike Protein Mutations in the United States (January 2020-March 2021) Using a Statistical Learning Strategy.

Zhao, Lue Ping; Lybrand, Terry P; Gilbert, Peter B; Hawn, Thomas R; Schiffer, Joshua T; Stamatatos, Leonidas; Payne, Thomas H; Carpp, Lindsay N; Geraghty, Daniel E; Jerome, Keith R.

Viruses ; 14(1)2021 12 21.

Article in English | MEDLINE | ID: covidwho-1580415

ABSTRACT

The emergence and establishment of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants of interest (VOIs) and variants of concern (VOCs) highlight the importance of genomic surveillance. We propose a statistical learning strategy (SLS) for identifying and spatiotemporally tracking potentially relevant Spike protein mutations. We analyzed 167,893 Spike protein sequences from coronavirus disease 2019 (COVID-19) cases in the United States (excluding 21,391 sequences from VOI/VOC strains) deposited at GISAID from 19 January 2020 to 15 March 2021. Alignment against the reference Spike protein sequence led to the identification of viral residue variants (VRVs), i.e., residues harboring a substitution compared to the reference strain. Next, generalized additive models were applied to model VRV temporal dynamics and to identify VRVs with significant and substantial dynamics (false discovery rate q-value < 0.01; maximum VRV proportion >10% on at least one day). Unsupervised learning was then applied to hierarchically organize VRVs by spatiotemporal patterns and identify VRV-haplotypes. Finally, homology modeling was performed to gain insight into the potential impact of VRVs on Spike protein structure. We identified 90 VRVs, 71 of which had not previously been observed in a VOI/VOC, and 35 of which have emerged recently and are durably present. Our analysis identified 17 VRVs ~91 days earlier than their first corresponding VOI/VOC publication. Unsupervised learning revealed eight VRV-haplotypes of four VRVs or more, suggesting two emerging strains (B1.1.222 and B.1.234). Structural modeling supported a potential functional impact of the D1118H and L452R mutations. The SLS approach equally monitors all Spike residues over time, independently of existing phylogenic classifications, and is complementary to existing genomic surveillance methods.

Subject(s)

COVID-19/virology , SARS-CoV-2/genetics , Spike Glycoprotein, Coronavirus/genetics , Amino Acid Sequence , COVID-19/epidemiology , Haplotypes , Humans , Models, Molecular , Models, Statistical , Mutation , SARS-CoV-2/classification , SARS-CoV-2/isolation & purification , Spatio-Temporal Analysis , Spike Glycoprotein, Coronavirus/chemistry , United States/epidemiology , Unsupervised Machine Learning

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL